Q1
library(tidyverse)
setwd("C:/Users/User/Documents/GitHub/biomarkers-group9")
Warning: The working directory was changed to C:/Users/User/Documents/GitHub/biomarkers-group9 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
## 1. Get data
# get names
var_names <- read_csv('data/biomarker-raw.csv',
col_names = F,
n_max = 2,
col_select = -(1:2)) %>%
t() %>%
as_tibble() %>%
rename(name = V1,
abbreviation = V2) %>%
na.omit()
Rows: 2 Columns: 1318── Column specification ─────────────────────────────────────────
Delimiter: ","
chr (1318): X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
# function for trimming outliers (good idea??)
trim <- function(x, .at){
x[abs(x) > .at] <- sign(x[abs(x) > .at])*.at
return(x)
}
# read in data
biomarker_raw <- read_csv('data/biomarker-raw.csv',
skip = 2,
col_select = -2L,
col_names = c('group',
'empty',
pull(var_names, abbreviation),
'ados'),
na = c('-', '')) %>%
filter(!is.na(group)) %>%
# reorder columns
select(group, ados, everything())
Rows: 155 Columns: 1319── Column specification ─────────────────────────────────────────
Delimiter: ","
chr (1): group
dbl (1318): CHIP, CEBPB, NSE, PIAS4, IL-10 Ra, STAT3, IRF1, c...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# export as r binary
save(list = 'biomarker_raw',
file = 'data/biomarker-raw.RData')
biomarker_raw
biomarker_clean

##2. Plot the distribution for single variables
set.seed(1234)
n2 <- sample(2:1319,size = 5)
## Get mean values
## For raw data
raw_var <- biomarker_raw %>% select(n2)
## For clean data
clean_var <- biomarker_clean %>% select(n2)
## Pull the names
col.name <- colnames(raw_var)
## plot the histograms for single variable distribution
for (i in 1:5) {
par(mfrow=c(1,2))
## Raw
hist(pull(raw_var[i]),main = rbind(col.name[i],"raw"),xlab=rbind(col.name[i],"raw"))
## Clean
hist(pull(clean_var[i]),main = rbind(col.name[i],"clean"),xlab=rbind(col.name[i],"clean"))
}





In this part, we mainly draw the histograms to see the distributions
of the data before and after log transformations. Then, we try to
compare and find the properties of the distributions after log
transformation.
In step2, the data are collected from the mean values of 650 random
selected variables before and after log transformations. In this case,
we want to compare the distributions of the mean values before and after
log transformations, which could represents the distribution of the
whole data set. According to the two histograms above, it is pretty
obvious that the distribution of mean values are highly right skewed.
Besides, the range of the distribution is very large even if we set the
xlim. However, after log transformation, it is easy to find that the
range of clean distribution becomes much smaller (from -0.06 to 0.05),
and the distribution are more centered to middle at x=0. Besides,
compared to the raw data distribution, the new distribution are not that
skewed to the right.
In step3, we mainly random select 5 proteins to see the distribution
change of the single variable before and after the log transformation.
In this case, we could find how log transformation affect the
distribution of single protein level. Very Similar to what we observe
for the distributions of mean values above, we could find that the first
four raw distribution are skewed to the right with large ranges. For the
last variable, CHL1, it also slightly skewed to the right. After log
transformation, most new distributions become much likely to standard
normal distribution centered at x=0, with range -3 to 3. Only for hnRNP
K, its new distribution is still skewed to right because its original
distribution is too skewed.
Therefore, it is easy to find that log transformation could help us
transform our data from an highly skewed distribution to a normal
distribution, also decreasing the range of the data set. There are a lot
of advantage to do the log transformation. First of all, after
decreasing the range of the data, we could easily cluster the means and
variances of different variables to a small range, which could help us
easily observe and operate them.
More important, if we want to make regression model with those data
in the future, the original data might have some disadvantages. When
modeling variables with non-linear relationships, the chances of
producing errors may also be skewed negatively. In theory, we want to
produce the smallest error possible when making a prediction, while also
taking into account that we should not be over fitting the model. Over
fitting occurs when there are too many dependent variables in play that
it does not have enough generalization of the data set to make a valid
prediction.Therefore, the transformed data could effectively decrease
the dependency among variables to decrease the chances of over fitting
model, and decrease the prediction errors at the same time. Thus, using
the transformation of one or more variables improves the fit of the
model by transforming the distribution of the features to a more
normally-shaped bell curve.
LS0tDQp0aXRsZTogIkt1bnhpYW8gR2FvIFExIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQpkYXRlOiAiMjAyMi0xMC0yOSINCi0tLQ0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQpRMQ0KYGBge3J9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCiMjIDEuIEdldCBkYXRhDQojIGdldCBuYW1lcw0KdmFyX25hbWVzIDwtIHJlYWRfY3N2KCdkYXRhL2Jpb21hcmtlci1yYXcuY3N2JywgDQogICAgICAgICAgICAgICAgICAgICBjb2xfbmFtZXMgPSBGLCANCiAgICAgICAgICAgICAgICAgICAgIG5fbWF4ID0gMiwgDQogICAgICAgICAgICAgICAgICAgICBjb2xfc2VsZWN0ID0gLSgxOjIpKSAlPiUNCiAgdCgpICU+JQ0KICBhc190aWJibGUoKSAlPiUNCiAgcmVuYW1lKG5hbWUgPSBWMSwgDQogICAgICAgICBhYmJyZXZpYXRpb24gPSBWMikgJT4lDQogIG5hLm9taXQoKQ0KDQojIGZ1bmN0aW9uIGZvciB0cmltbWluZyBvdXRsaWVycyAoZ29vZCBpZGVhPz8pDQp0cmltIDwtIGZ1bmN0aW9uKHgsIC5hdCl7DQogIHhbYWJzKHgpID4gLmF0XSA8LSBzaWduKHhbYWJzKHgpID4gLmF0XSkqLmF0DQogIHJldHVybih4KQ0KfQ0KIyByZWFkIGluIGRhdGENCmJpb21hcmtlcl9yYXcgPC0gcmVhZF9jc3YoJ2RhdGEvYmlvbWFya2VyLXJhdy5jc3YnLCANCiAgICAgICAgIHNraXAgPSAyLA0KICAgICAgICAgY29sX3NlbGVjdCA9IC0yTCwNCiAgICAgICAgIGNvbF9uYW1lcyA9IGMoJ2dyb3VwJywgDQogICAgICAgICAgICAgICAgICAgICAgICdlbXB0eScsDQogICAgICAgICAgICAgICAgICAgICAgIHB1bGwodmFyX25hbWVzLCBhYmJyZXZpYXRpb24pLA0KICAgICAgICAgICAgICAgICAgICAgICAnYWRvcycpLA0KICAgICAgICAgbmEgPSBjKCctJywgJycpKSAlPiUNCiAgZmlsdGVyKCFpcy5uYShncm91cCkpICU+JQ0KICAjIHJlb3JkZXIgY29sdW1ucw0KICBzZWxlY3QoZ3JvdXAsIGFkb3MsIGV2ZXJ5dGhpbmcoKSkNCiMgZXhwb3J0IGFzIHIgYmluYXJ5DQpzYXZlKGxpc3QgPSAnYmlvbWFya2VyX3JhdycsIA0KICAgICBmaWxlID0gJ2RhdGEvYmlvbWFya2VyLXJhdy5SRGF0YScpDQoNCmJpb21hcmtlcl9yYXcNCmJpb21hcmtlcl9jbGVhbg0KYGBgDQoNCmBgYHtyfQ0KIyMyLiBQbG90IHRoZSBkaXN0cmlidXRpb24gZm9yIG1lYW4gdmFsdWVzDQpzZXQuc2VlZCgxMjMpDQpuIDwtIHNhbXBsZSgyOjEzMTksc2l6ZSA9IDY1MCkNCiMjIEdldCBtZWFuIHZhbHVlcw0KIyMgRm9yIHJhdyBkYXRhDQpyYXdfbWVhbiA8LSBiaW9tYXJrZXJfcmF3ICU+JSBzZWxlY3QoYWxsX29mKG4pKSAlPiUgc3VtbWFyaXNlX2FsbChmdW5zKG1lYW4pLG5hLnJtPVRSVUUpICU+JSBnYXRoZXIodmFyLCB2YWwpDQoNCiMjIEZvciBjbGVhbiBkYXRhDQpjbGVhbl9tZWFuIDwtIGJpb21hcmtlcl9jbGVhbiAlPiUgc2VsZWN0KGFsbF9vZihuKSkgJT4lIHN1bW1hcmlzZV9hbGwoZnVucyhtZWFuKSxuYS5ybT1UUlVFKSAlPiUgZ2F0aGVyKHZhciwgdmFsKQ0KDQojIyBwbG90IHRoZSBoaXN0b2dyYW1zDQpwYXIobWZyb3c9YygxLDIpKQ0KIyMgUmF3IG1lYW4NCmhpc3QocmF3X21lYW4kdmFsLGJyZWFrcyA9IDI1MCx4bGltID0gYygwLDMwMDAwKSxtYWluID0gIkhpc3RvZ3JhbSBvZiByYXdfbWVhbiIpDQoNCiMjIENsZWFuIG1lYW4NCmhpc3QoY2xlYW5fbWVhbiR2YWwsbWFpbiA9ICJIaXN0b2dyYW0gb2YgY2xlYW5fbWVhbiIpDQpgYGANCg0KYGBge3J9DQojIzMuIFBsb3QgdGhlIGRpc3RyaWJ1dGlvbiBmb3Igc2luZ2xlIHZhcmlhYmxlcw0Kc2V0LnNlZWQoMTIzNCkNCm4yIDwtIHNhbXBsZSgyOjEzMTksc2l6ZSA9IDUpDQojIyBHZXQgbWVhbiB2YWx1ZXMNCiMjIEZvciByYXcgZGF0YQ0KcmF3X3ZhciA8LSBiaW9tYXJrZXJfcmF3ICU+JSBzZWxlY3QobjIpDQoNCiMjIEZvciBjbGVhbiBkYXRhDQpjbGVhbl92YXIgPC0gYmlvbWFya2VyX2NsZWFuICU+JSBzZWxlY3QobjIpDQoNCiMjIFB1bGwgdGhlIG5hbWVzDQpjb2wubmFtZSA8LSBjb2xuYW1lcyhyYXdfdmFyKQ0KDQojIyBwbG90IHRoZSBoaXN0b2dyYW1zIGZvciBzaW5nbGUgdmFyaWFibGUgZGlzdHJpYnV0aW9uDQpmb3IgKGkgaW4gMTo1KSB7DQogIHBhcihtZnJvdz1jKDEsMikpDQojIyBSYXcgDQpoaXN0KHB1bGwocmF3X3ZhcltpXSksbWFpbiA9IHJiaW5kKGNvbC5uYW1lW2ldLCJyYXciKSx4bGFiPXJiaW5kKGNvbC5uYW1lW2ldLCJyYXciKSkNCiMjIENsZWFuIA0KaGlzdChwdWxsKGNsZWFuX3ZhcltpXSksbWFpbiA9IHJiaW5kKGNvbC5uYW1lW2ldLCJjbGVhbiIpLHhsYWI9cmJpbmQoY29sLm5hbWVbaV0sImNsZWFuIikpDQp9DQpgYGANCg0KSW4gdGhpcyBwYXJ0LCB3ZSBtYWlubHkgZHJhdyB0aGUgaGlzdG9ncmFtcyB0byBzZWUgdGhlIGRpc3RyaWJ1dGlvbnMgb2YgdGhlIGRhdGEgYmVmb3JlIGFuZCBhZnRlciBsb2cgdHJhbnNmb3JtYXRpb25zLiBUaGVuLCB3ZSB0cnkgdG8gY29tcGFyZSBhbmQgZmluZCB0aGUgcHJvcGVydGllcyBvZiB0aGUgZGlzdHJpYnV0aW9ucyBhZnRlciBsb2cgdHJhbnNmb3JtYXRpb24uIA0KDQpJbiBzdGVwMiwgdGhlIGRhdGEgYXJlIGNvbGxlY3RlZCBmcm9tIHRoZSBtZWFuIHZhbHVlcyBvZiA2NTAgcmFuZG9tIHNlbGVjdGVkIHZhcmlhYmxlcyBiZWZvcmUgYW5kIGFmdGVyIGxvZyB0cmFuc2Zvcm1hdGlvbnMuIEluIHRoaXMgY2FzZSwgd2Ugd2FudCB0byBjb21wYXJlIHRoZSBkaXN0cmlidXRpb25zIG9mIHRoZSBtZWFuIHZhbHVlcyBiZWZvcmUgYW5kIGFmdGVyIGxvZyB0cmFuc2Zvcm1hdGlvbnMsIHdoaWNoIGNvdWxkIHJlcHJlc2VudHMgdGhlIGRpc3RyaWJ1dGlvbiBvZiB0aGUgd2hvbGUgZGF0YSBzZXQuIEFjY29yZGluZyB0byB0aGUgdHdvIGhpc3RvZ3JhbXMgYWJvdmUsIGl0IGlzIHByZXR0eSBvYnZpb3VzIHRoYXQgdGhlIGRpc3RyaWJ1dGlvbiBvZiBtZWFuIHZhbHVlcyBhcmUgaGlnaGx5IHJpZ2h0IHNrZXdlZC4gQmVzaWRlcywgdGhlIHJhbmdlIG9mIHRoZSBkaXN0cmlidXRpb24gaXMgdmVyeSBsYXJnZSBldmVuIGlmIHdlIHNldCB0aGUgeGxpbS4gSG93ZXZlciwgYWZ0ZXIgbG9nIHRyYW5zZm9ybWF0aW9uLCBpdCBpcyBlYXN5IHRvIGZpbmQgdGhhdCB0aGUgcmFuZ2Ugb2YgY2xlYW4gZGlzdHJpYnV0aW9uIGJlY29tZXMgbXVjaCBzbWFsbGVyIChmcm9tIC0wLjA2IHRvIDAuMDUpLCBhbmQgdGhlIGRpc3RyaWJ1dGlvbiBhcmUgbW9yZSBjZW50ZXJlZCB0byBtaWRkbGUgYXQgeD0wLiBCZXNpZGVzLCBjb21wYXJlZCB0byB0aGUgcmF3IGRhdGEgZGlzdHJpYnV0aW9uLCB0aGUgbmV3IGRpc3RyaWJ1dGlvbiBhcmUgbm90IHRoYXQgc2tld2VkIHRvIHRoZSByaWdodC4NCg0KSW4gc3RlcDMsIHdlIG1haW5seSByYW5kb20gc2VsZWN0IDUgcHJvdGVpbnMgdG8gc2VlIHRoZSBkaXN0cmlidXRpb24gY2hhbmdlIG9mIHRoZSBzaW5nbGUgdmFyaWFibGUgYmVmb3JlIGFuZCBhZnRlciB0aGUgbG9nIHRyYW5zZm9ybWF0aW9uLiBJbiB0aGlzIGNhc2UsIHdlIGNvdWxkIGZpbmQgaG93IGxvZyB0cmFuc2Zvcm1hdGlvbiBhZmZlY3QgdGhlIGRpc3RyaWJ1dGlvbiBvZiBzaW5nbGUgcHJvdGVpbiBsZXZlbC4gVmVyeSBTaW1pbGFyIHRvIHdoYXQgd2Ugb2JzZXJ2ZSBmb3IgdGhlIGRpc3RyaWJ1dGlvbnMgb2YgbWVhbiB2YWx1ZXMgYWJvdmUsIHdlIGNvdWxkIGZpbmQgdGhhdCB0aGUgZmlyc3QgZm91ciByYXcgZGlzdHJpYnV0aW9uIGFyZSBza2V3ZWQgdG8gdGhlIHJpZ2h0IHdpdGggbGFyZ2UgcmFuZ2VzLiBGb3IgdGhlIGxhc3QgdmFyaWFibGUsIENITDEsIGl0IGFsc28gc2xpZ2h0bHkgc2tld2VkIHRvIHRoZSByaWdodC4gQWZ0ZXIgbG9nIHRyYW5zZm9ybWF0aW9uLCBtb3N0IG5ldyBkaXN0cmlidXRpb25zIGJlY29tZSBtdWNoIGxpa2VseSB0byBzdGFuZGFyZCBub3JtYWwgZGlzdHJpYnV0aW9uIGNlbnRlcmVkIGF0IHg9MCwgd2l0aCByYW5nZSAtMyB0byAzLiBPbmx5IGZvciBoblJOUCBLLCBpdHMgbmV3IGRpc3RyaWJ1dGlvbiBpcyBzdGlsbCBza2V3ZWQgdG8gcmlnaHQgYmVjYXVzZSBpdHMgb3JpZ2luYWwgZGlzdHJpYnV0aW9uIGlzIHRvbyBza2V3ZWQuDQoNClRoZXJlZm9yZSwgaXQgaXMgZWFzeSB0byBmaW5kIHRoYXQgbG9nIHRyYW5zZm9ybWF0aW9uIGNvdWxkIGhlbHAgdXMgdHJhbnNmb3JtIG91ciBkYXRhIGZyb20gYW4gaGlnaGx5IHNrZXdlZCBkaXN0cmlidXRpb24gdG8gYSBub3JtYWwgZGlzdHJpYnV0aW9uLCBhbHNvIGRlY3JlYXNpbmcgdGhlIHJhbmdlIG9mIHRoZSBkYXRhIHNldC4gVGhlcmUgYXJlIGEgbG90IG9mIGFkdmFudGFnZSB0byBkbyB0aGUgbG9nIHRyYW5zZm9ybWF0aW9uLiBGaXJzdCBvZiBhbGwsIGFmdGVyIGRlY3JlYXNpbmcgdGhlIHJhbmdlIG9mIHRoZSBkYXRhLCB3ZSBjb3VsZCBlYXNpbHkgY2x1c3RlciB0aGUgbWVhbnMgYW5kIHZhcmlhbmNlcyBvZiBkaWZmZXJlbnQgdmFyaWFibGVzIHRvIGEgc21hbGwgcmFuZ2UsIHdoaWNoIGNvdWxkIGhlbHAgdXMgZWFzaWx5IG9ic2VydmUgYW5kIG9wZXJhdGUgdGhlbS4gDQoNCk1vcmUgaW1wb3J0YW50LCBpZiB3ZSB3YW50IHRvIG1ha2UgcmVncmVzc2lvbiBtb2RlbCB3aXRoIHRob3NlIGRhdGEgaW4gdGhlIGZ1dHVyZSwgdGhlIG9yaWdpbmFsIGRhdGEgbWlnaHQgaGF2ZSBzb21lIGRpc2FkdmFudGFnZXMuIFdoZW4gbW9kZWxpbmcgdmFyaWFibGVzIHdpdGggbm9uLWxpbmVhciByZWxhdGlvbnNoaXBzLCB0aGUgY2hhbmNlcyBvZiBwcm9kdWNpbmcgZXJyb3JzIG1heSBhbHNvIGJlIHNrZXdlZCBuZWdhdGl2ZWx5LiBJbiB0aGVvcnksIHdlIHdhbnQgdG8gcHJvZHVjZSB0aGUgc21hbGxlc3QgZXJyb3IgcG9zc2libGUgd2hlbiBtYWtpbmcgYSBwcmVkaWN0aW9uLCB3aGlsZSBhbHNvIHRha2luZyBpbnRvIGFjY291bnQgdGhhdCB3ZSBzaG91bGQgbm90IGJlIG92ZXIgZml0dGluZyB0aGUgbW9kZWwuIE92ZXIgZml0dGluZyBvY2N1cnMgd2hlbiB0aGVyZSBhcmUgdG9vIG1hbnkgZGVwZW5kZW50IHZhcmlhYmxlcyBpbiBwbGF5IHRoYXQgaXQgZG9lcyBub3QgaGF2ZSBlbm91Z2ggZ2VuZXJhbGl6YXRpb24gb2YgdGhlIGRhdGEgc2V0IHRvIG1ha2UgYSB2YWxpZCBwcmVkaWN0aW9uLlRoZXJlZm9yZSwgdGhlIHRyYW5zZm9ybWVkIGRhdGEgY291bGQgZWZmZWN0aXZlbHkgZGVjcmVhc2UgdGhlIGRlcGVuZGVuY3kgYW1vbmcgdmFyaWFibGVzIHRvIGRlY3JlYXNlIHRoZSBjaGFuY2VzIG9mIG92ZXIgZml0dGluZyBtb2RlbCwgYW5kIGRlY3JlYXNlIHRoZSBwcmVkaWN0aW9uIGVycm9ycyBhdCB0aGUgc2FtZSB0aW1lLiBUaHVzLCB1c2luZyB0aGUgdHJhbnNmb3JtYXRpb24gb2Ygb25lIG9yIG1vcmUgdmFyaWFibGVzIGltcHJvdmVzIHRoZSBmaXQgb2YgdGhlIG1vZGVsIGJ5IHRyYW5zZm9ybWluZyB0aGUgZGlzdHJpYnV0aW9uIG9mIHRoZSBmZWF0dXJlcyB0byBhIG1vcmUgbm9ybWFsbHktc2hhcGVkIGJlbGwgY3VydmUuDQoNCg0KDQoNCg0KDQoNCg0K